Bayesian approach to clustering real value, hypergraph and bipartite graph data: 

solution via variational methods 
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Data clustering, including problems such as finding network communities, can be put into a 
systematic framework by means of a Bayesian approach. The application of Bayesian approaches to 
real problems can be, however, quite challenging. In most cases the solution is explored via Monte 
Carlo sampling or variational methods. Here we work further on the application of variational 
methods to clustering problems. We introduce generative models based on a hidden group structure 
and prior distributions. We extend previous attends by Jaynes, and derive the prior distributions 
based on symmetry arguments. As a case study we address the problems of two-sides clustering real 
value data and clustering data represented by a hypergraph or bipartite graph. From the variational 
calculations, and depending on the starting statistical model for the data, we derive a variational 
Bayes algorithm, a generalized version of the expectation maximization algorithm with a built in 
penalization for model complexity or bias. We demonstrate the good performance of the variational 
Bayes algorithm using test examples. 
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I. INTRODUCTION 



Mixture models provide an intuitive statistical repre- 
sentation of datasets structured in groups, clusters or 
classes A complex dataset is decomposed into the 
superposition of simpler datasets. The inverse problem 
consists in determining the group decomposition and the 
statistical parameters characterizing each group. For a 
fixed number of groups the expectation maximization 
(EM) algorithm provides a recursive solution to the in- 
verse problem [2(]. The estimation of the right number 
or groups has been, however, a great challenge. Cor- 
rections such as the Arkaike information criterion (AIC) 
[H and the Bayesian information criterion (BIC) Q have 
been derived, penalizing model complexity and overfit- 
ting. Yet, the number of groups estimated from these 
criteria is in general unsatisfactory. 

In contrast, a Bayesian approach would not attempt 
to estimate what is the "optimal" number of groups, but 
instead average over models with a different number of 
groups |5J . The Bayesian approach is becoming a popular 
technique to solve problems in data analysis, model selec- 
tion and hypothesis testing [1, 0, @] ■ Many of the original 
ideas come from the early work of Jeffreys @, but it is 
just recently that they are starting to be used widely 
Hi S Hi • The application of Bayesian approaches to real 
problems can be, however, quite challenging. In most 
cases the solution is explored via Monte Carlo sampling 
@, Q or variational methods @, E3, EH • The application 
of variational methods to Bayesian problems results in 
the variational Bayes (VB) algorithm 0, 03. The VB 
algorithm is a set of self-consistent equations analog to 
the EM algorithm. They can be solved recursively ob- 
taining an approximate solution to the inverse inference 
problem. These methods have been applied, for example, 
to Gaussian mixture models for real value data [l], [12] , 
Dirichlet mixture models for categorical data [HI and the 



problem of finding graph modules [14j . 

Here we further study the use of variational methods 
in the context of Bayesian approaches, focusing on data 
clustering problems. I the first two sections we review 
the Bayesian approach. In Section |TT] we revisit the con- 
nection between the Bayesian formulation and statistical 
mechanics. In section IIIII we introduce the generalities 
of generative models with a hidden structure at the sam- 
ples side and at both the samples and variables side. In 
Section lTVl we extend the previous work by Jaynes [ToT ] de- 
riving prior distributions based on symmetry properties. 
We report a correction to his result for the model with a 
location and scale parameter and an extension of his re- 
sult for the binomial model to the multinomial model. In 
the following Sections we study the problem of two-sides 
clustering real value data and of clustering data repre- 
sented by a hypergraph or bipartite graph. Depending 
on our starting statistical model, we obtain a VB algo- 
rithm. Because of its Bayesian root, the VB algorithms 
have a built in correction for model complexity or bias 
and, therefore, they do not require the use of additional 
complexity criteria. The performance of the VB algo- 
rithms is tested in some examples, obtaining satisfactory 
results whenever there is a significant distinction between 
the groups. 



II. BAYESIAN APPROACH AND 
VARIATIONAL SOLUTION 

The Bayesian approach is a systematic methodology 
to interpret complex datasets and to evaluate model hy- 
pothesis. Its main ingredients or steps are: given a 
dataset D, (i) introduce a statistical model with model 
parameters, (f>, (ii) write down the likelihood to ob- 
serve the data given the proposed model and parame- 
ters, P{D\<t>), (iii) determine the prior distribution for 
the model parameters based on our current knowledge, 
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P(4>), and, finally, (iv) invert the statistical model of the 
data given the likelihood and prior distribution to obtain 
the posterior distribution of the model parameters given 
the model and data, P(<p\D). The latter step is based on 
Bayes rule 

P{4>\D) = ^P{D\4>)P{<t>) (1) 

where 



Z = P{D) = J d9P(D\0)P(0) . (2) 

Having obtained the distribution of the model parame- 
ters, at least formally, we can determine other magni- 
tudes. For example, the average of a quantity A(<p) is 
given by 



temperature of the system, the temperature being ex- 
pressed in units of the Boltzman constant . Minus the 
average log likelihood plays the role of the internal en- 
ergy, the Kullback-Leibler divergence of Q(<p) plays the 
role of the entropy and temperature equals one. 

Equation ([5]) emphasizes the two components deter- 
mining the best choice of variational distribution Q((f>): 
better fit to the data and model bias. How well the data 
is fitted is quantified by the internal energy U (JHJ. To 
achieve the best fit, or internal energy ground state, Q{4>) 
should be concentrated around the regions of the param- 
eter space where P(D\cf>) is maximum. The best choice 
in this respect will by the maximum likelihood estimate 
(MLE) 

Qmle(0) =(5(0-0*) (8) 

where 



(A(<f>)) = J d<f>P(<f>\D)A(0) . (3) 

In practice calculating (|2| or Q is a formidable task. 
A very powerful approximation scheme is the variational 
method 0, fic| . The main idea of the variational method 
is to approximate the generally difficult to handle dis- 
tribution P{<f)\D) by a distribution Q((f>\D) of a more 
tractable form. In the following we omit the dependency 
of Q on D and just write Q(4>). Given Q((j>) we can obtain 
a bound for F = — In Z using Jensen's inequality 

F ::I/^ W ») 

The latter equation can be rewritten as 0] 



(f>* = m&xP(D\<j)) . (9) 

<t> 

In the opposite extreme, when no data is presented to 
us, the best distribution is that maximizing the Kullback- 
Leibler divergence relative to the prior distribution. This 
maximum entropy (ME) solution is the prior distribution 
itself 

Qme{4>) = P{4>) ■ (10) 

In general, the drive to better fit the data is opposed by 
the tendency to obtain the least unbiased model. The 
variational solution is therefore in the middle between 
the one extreme of biased models fitting the data very 
well and completely unbiased models giving a bad fit to 
the data. It is obtained after minimizing ([5]) with respect 
to Q(4>) over a restricted class of functions. This varia- 
tional solution Q{4>) represents the closest distribution to 
P(4>\D) within the class of functions considered. 



F < U — TS (5) 

where T = 1, 

U = - J d4>Q{4>)\nP{D\(t>) (6) 
is minus the average log likelihood and 

5 = -y#Q(0)ln||| (7) 

is the Kullback-Leibler divergence of Q(4>) relative to the 
prior distribution P(4>) [l6j |. Equation © resembles the 
usual free energy in statistical mechanics: F = U — TS, 
where U, S and T are the internal energy, entropy and 



III. STATISTICAL MODEL WITH A 
POPULATION STRUCTURE 

In this section we present the generalities of statistical 
models with a first level population structure. Similar 
models has been studied in [l3|, H3 |. Our working hy- 
pothesis is that there is a hidden population structure, 
characterized by the subdivision of the population sam- 
ples into groups. We assume that we are given a dataset 
D which, in some way to be determined, reflects the pop- 
ulation structure. The problem consist in inferring this 
hidden structure and the associated model parameters 
from the data. To tackle this problem we introduce a 
statistical model with a built in population structure as 
a generative model of the data. The population structure 
and the model parameters are then inferred solving the 
inverse problem. More precisely 
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(i) We consider a population composed of n elements 
divided in K groups. 

(ii) The samples assignment to groups is generated by 
a multinomial model with probabilities irk, k = 
1, . . . , K. Denoting by g t the group to which the 
i-th sample belongs, we obtain 

n 

*>(<?k) = ]]>«• (ii) 

(iii) Given the group assignments gt , and depending on 
the dataset, we write down the likelihood P(D\g, 9) 
to observe the data parametrized by the parameter 
set 9. 

(iv) Putting all this together we obtain the posterior 
distribution 



P{<t>\D) = ^P(D\g, e)P(g\n)P(6)P(w)P(K) , (12) 

where = (g,9,n,K) and P{6), P{n) and P{K) 
are the prior distributions of 9, n and K. 

The form of the prior distributions, except for P(K), is 
the subject of the next section. The distribution P(K) 
is irrelevant for problems with large datasets. The differ- 
ence between the log-likelihood of models with different 
values of K is in general of the order of the dataset size 
and, as a consequence, the contribution of \nP(K) is neg- 
ligible. Thus, in the following sections we simply neglect 
the contribution given by P(K). Finally, we specify the 
likelihood P(D\g, 9) when addressing specific problems. 

In some cases we are going to assume that the vari- 
ables in our dataset are also divided in groups. Here we 
consider a set of m variables divided in L groups. The 
variables assignment to groups is generated by a multino- 
mial model with probabilities Ki, I = 1, . . . , L. Denoting 
by c ji 3 = 1) ■ ■ ■ i m i the variable group to which variable 
j belongs we can then write 



IV. PRIOR DISTRIBUTIONS 

The choice of the prior distribution P(<fr) is proba- 
bly one of the less obvious topics in Bayesian analy- 
sis. Currently the predominant choice is the use of 
conjugate priors. The form of conjugate priors is in- 
dicated by the likelihood, making the prior selection 
less ambiguous. For example, the binomial likelihood 
P(n\p) oc p n (l — p) N ~ n suggests a beta distribution for 
P(p\n). Furthermore, by choosing a beta distribution 

as a prior, P(p) oc p Q_1 (l — p)^ -1 , the posterior distri- 
bution remains a beta distribution, but with exponents 
a = a + n and /3 = /3 + N — n. In this sense, the beta 
distribution is the conjugate prior of the binomial likeli- 
hood. A list of conjugate priors relevant for this work is 
provided in Table IIVI 

Yet, the fact that the form of conjugate priors is sug- 
gested by the likelihood does not demonstrate that they 
are the correct choice of priors. Moreover, even if we ac- 
cept their use, it is not clear what is the correct choice for 
the prior distribution parameters, e.g. a and /?. Different 
methods have been proposed to determine these param- 
eters. In general they are based on a posteriori analyzes, 
e.g. calculations, making use of the data in some way 
or another. Such methods violate, however, the concept 
of prior distribution, defined as the distribution of the 
model parameters in the absence of the data. 

An alternative approach is that by Jaynes '15| . Ac- 
cording to Jaynes, in the absence of any data, the priors 
should be solely determined based on the symmetries and 
constraints of the problem under consideration. In this 
work we make use of Jaynes's approach to determine the 
prior distribution. Below we derive Jaynes's priors for 
the cases relevant for this work. 



A. Prior for a model with location and scale 
parameters 

Consider a problem where the data consists of equally 
distributed random variables Xi, i, . . . ,n, taking real val- 
ues. Furthermore let us assume that the likelihood has 
the form 



p(ci«)=n«c J ( i3 ) 

After adding this variable group structure, the posterior 
distribution (TT2"|) is replaced by 

P{4>\D) = ^P(D\g,c,e)P(g\Tr)P(c\K) 

x P(tt)P(k)P(9)P(K) , (14) 

where <p = (g, c,9,it, k, K) and P(n) is the prior distri- 
bution of K. 



P(X| M) a) = n/(^^)i, (15) 

where f{x) is a probability density function in the real 
line and /x and a are a location and scale parameter re- 
spectively. Our task consist in determining the prior dis- 
tribution of fi and a. Now, suppose Xi represent posi- 
tions, which could be measured from difference systems 
of reference and using different units. In this context 
the prior distribution should be the same regardless of 
our system of reference and units. More precisely, our 
system is invariant under the transformations 
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Model 


Likelihood 


Conjugate prior 


Invariant prior 


Renormalization limit 


Binomial 
Multinomial 

Normal 


CK(i-p)"-" 


D(^;7) = BknLi< fe - 1 


const.p _1 (l — 
const. n,=i 

const. 
CT 2 


& -» 0, /3 -> 
7fc -» 

5^0 



TABLE I: Prior distributions: Examples of model likelihoods and their associated conjugated priors and invariant pri- 
ors. Beta(a;;a,fo) denotes the probability density function of the beta distribution, where B(a, b) is the beta function. 
D(x;y) denotes the probability density function of the Dirichlet distribution, the generalized beta distribution, where 
B(7) = r(5^ fc 7fe)/ Ylk r(7fc) is the generalized beta function. The renormalization limit column indicates the limit in which 
the conjugate prior approaches the invariant prior. 



x = a(x + b) 
fi' = a(/z + b) 
a 1 = aa 



(16) 



where b represents a translation and a a change of scale 
or units. The likelihood is invariant under these transfor- 
mations and so must be the prior distribution. Therefore, 



P(jjl\ a')dfi'da' = P(fi, a)dfxda (17) 
The solution to this functional equation is 



const. 



(18) 



This analysis was first reported by Jaynes [l5|. He ob- 
tained, however, P(fx,cr) oc 1/u. This discrepancy is 
rooted in the fact that Jaynes did not take into account 
that the location parameter /i follows the same rules than 
x upon the translation and scale transformations. He as- 
sumed p! = fi + b [HI while the correct transformation is 
p! = a(n + b) (HU). 



B. Prior for the multinomial model 



Consider the multinomial model with K states 



P(n\ir) 



(EaLi 



in- 



\Ai=i ^ 



I K 

: n 



(19) 



k=l 



where n k is the number of times state k was observed 
and 7tfe is the probability to observe state k in one trial, 
< 7Tfc < 1 and X)fe=i n k = 1- Here we extend the 
approach followed by Jaynes for the binomial model [l5| . 

The probabilities irk may be different depending on 
our believe, e.g. all states are equally probable. Differ- 
ent investigators may have different believes, resulting in 
different choices of 7Tfc. The main assumption is that the 
prior distribution should be independent of what is our 



specific believe and, therefore, should be invariant under 
a believe transformation. 

Believe transformation: Let us represent by Sk the 
state fc, and let P(Sk\E) and P(Sk\E') be the probabili- 
ties to observe state Sk in one trial according to believe 
E and E', respectively. From Bayes rule it follows that 



P(S k \E') 



P(E'\S k ,E)P(S k \E) 
^ j P(E'\S j ,E)P(S j \E) 



(20) 



for k = 1, . 
as 



. , K. The latter equation can be rewritten 



7T t 



Ok 

A 



(21) 



for k = 1, . . . , K — 1 and n' K = 1 — X}fc<K TrjJ., where 
n k = P(S k \E), 7r' k = P(S k \E'), 



P(E'\S k ,E) 
P(E'\S K ,E) 



and 



.4 



1 + J2 (ajk - l)7T fc 



(22) 



(23) 



k<K 



Equation (|21[) provides the transformation rules of the 
probabilities 7r fc from one system of believe to another. 

The invariance under the above transformation lead to 
the functional equation 



P{TT')dir' = P{n)dir , 



(24) 



To solve this equation we first need to compute the de- 
terminant of the transformation Jacobian. The Jacobian 
of the transformation (1211) has the matrix elements 



7 - d -< 



~A~ 



a l (a j - Vj-Kj 
A 2 



(25) 



i,j = 1, . . . , K — 1. This matrix can be decomposed into 
the product J = BC, where Bij = aidij/A is a diagonal 
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matrix and Cjj = £y — (aj — l)m/A has two eigenvalues, 
Ai = A^ 1 and an— 2-degenerate eigenvalue A 2 = 1. 
Putting all together we obtain 

ij| = i£|AiAr 2 = -^ll a *' ( 26 ) 

fc=l 

The solution of l|24[). with d7r' = | J|c?7r, is given by 

if 

P(tt) = const. Jl^ 1 • ( 27 ) 

4 = 1 

Note that for K = 2, 7Ti = p and 7r 2 = 1 — p, we recover 
the result by Jaynes for the binomial model 

P^ocp-^l-p)- 1 . (28) 

C. Improper priors renormalization 

The prior distributions (fT5| and (|2"T|) are improper, i.e. 
their integral over the parameter space is not finite. At 
first this may sound an unsuitable property for a prior 
distribution. Nevertheless, the improper nature of these 
prior distributions is just indicating that the symmetries 
in our problem are not sufficient to fully determine them. 
Data is required to obtain a proper distribution. The 
best example for an intuitive understanding of these ar- 
guments is the prior distribution of the location parame- 
ter. In the absence of any data and under the assumption 
of translational invariance, it is clear that every value in 
the real line is an equally probable value for the location 
parameter, resulting in an improper prior. 

From the operational point of view, the posterior 
distribution may be proper even when the prior is 
not. Indeed, the integral J d<pP(<p) may be improper, 
/ d(j)P{<j)\D) oc / d(j)P{D\^)p({(j)) may be proper. The 
posterior distribution can be improper when the infer- 
ence problem has not been correctly formulated or there 
is not sufficient data to determine the model parameters. 

To avoid dealing with improper distributions, we can 
renormalize improper priors to some limit of a proper 
distribution. Since conjugate priors facilitate analytical 
calculations they are a good starting point. This is il- 
lustrated in Table (|IV[) for selected examples. These are 
the prior distributions used herein. In particular, for the 
multinomial probabilities 7r and k we use the renormal- 
ized invariant priors 

v ; i=i 

with 7 — > and e; — > 0. 



V. MEAN-FIELD APPROXIMATION 

In this section we specify the form of the variational 
function Q(<j>). To allow for an analytical solution we ne- 
glect correlations between the group assignments and the 
remaining model parameters. We denote by pik the prob- 
ability that sample i belongs to sample group k and by 
qji the probability that probe j belongs to probe group 
I. Furthermore, given that 9, -k and n always appear 
in different factors in (|12[) or ([TJJ then their join distri- 
bution factorizes. Within the mean-field approximation 
for the group assignments and the later factorization the 
variational function can be written as 

Q(<t>) = l[pi gi R(0)RW (3i) 

i 

when dealing with the generative model (fT2"j) and 

Q{4>) = Hqj Cj R( e ) R ^) R ( K ) (32) 

i 3 

when dealing with the generative model (|14p . where R(x) 
denotes a generic probability density function of x. 

Summarizing, in the case studies below, we are going 
to solve the generative models (IT2"j) or ([14")) , making use 
of renormalized invariant priors (Table IIV[) and the MF 
variational function pTj) or (|52"|) , respectively. This ap- 
proach is based on the assumptions that: the population 
is divided in groups, the group assignments are gener- 
ated by a multinomial model, the priors are renormal- 
ized invariant distributions, and a MF approximation of 
the variational solution with respect to the group assign- 
ments. 



VI. CASE STUDY: CLUSTERING REAL VALUE 
DATA 

Quite often we deal with datasets consisting of a real 
value measurement Xij over i — 1 , . . . , n samples and 
j = 1, . . . , m variables, where the samples and variables 
are not necessarily independent. For simplicity, the par- 
ticular kind of dependency we focus on is the existence of 
sample and variable groups. Our problem is to infer the 
sample and variable groups and the statistical parame- 
ters characterizing them. 

To address this problem we consider the generative 
model (fl4|) with a normal likelihood, representing a two- 
sides Gaussian mixture model. The two-sides Gaussian 
mixture model is a natural extension of the Gaussian 
mixture model [l|, to characterize datasets with a 
group structure for both the samples and variables. Our 
contributions in this context are the use of prior distri- 
butions derived from symmetry arguments alone and the 
inclusion of a group structure at the variables side. The 
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dataset, likelihood and priors associated with our statis- Minimizing (|36|) with respect to pu, qji, R(/i,a), R(tt) 

tical model are defined as follows: and R(k) we obtain (VB-1): 

Data: Consider i = 1, ... ,71 samples, j — 1, . . . , m 
variables, and the real value measurements Xy. 2 

Likelihood: We assume that are random variables ( ln7r fc)-^j Z^-i ?ji ( ■^-+(^ii-<Mfci>) 2 J 
with a normal distribution, with group dependent mean p ik = — (37) 



Hg iCj and group independent variance a, resulting in the „ ( ln7I "s}~^r E)^ in ^-afj+PQj— (Msi>) 

likelihood ^ s 6 



P(X\g,c,fi,a)=Y[ 



V2na 2 



(33) 



Here we are assuming that the main difference between 
groups is given by the means while the variance is group 
independent. The latter is a good approximation when 
the source of noise is given by the measurement itself 
and it behaves the same independently of the sample and 
variable group. 

Priors: For the prior P(/i, a) we generalize the Normal 
distribution prior in Table [TV] Accounting for more than 
one location parameter we obtain 



P{n,a) 



r(|) o*+i 



e~ a ^ (34) 
e -^(^i-A fc ;) 2 (35) 



and we work in the limit a — * 0. 

To apply the variational method we consider the MF 
approximation (f32|) . Substituting the likelihood (|33j) . the 
priors ![29|). ([30]) and (J34J) and the MF variational function 
(|32[) into ([5|), and integrating over <f> (summing over gi 
and Cj and integrating over fi^i, cr, irk and k;) we obtain 



F < const. + (nm + KL + a + l)(lna) 



ijkl 



a 

+ 2 



y\ ^2 / + \ CT 2 ■ 



dudaR(fi, a) In i?(/x, cr) 
d7ri?(7r)lni?(vr) + / dnR(n)\n R(k)k 

Hi 



R{fj.,a) 



r(f)a«+i' 



n 

A:/ 



27ra 2 



«fci = 5 + y^pzfcgji 



a = a + nm 



1:1 



1 



a + PikQjl 



a + nm 



^Pikqji (Xfj - (fiki) 2 ) 

ijkl 



(38) 



(39) 



(40) 



(41) 



(42) 



(43) 



E(tt) = D(tt; 7) , 7fc = 7fc + J]p jfc (44) 



R(k) — D(k; e) , e; = q + 9jl (45) 



F* = const. + y^^Pik lnp t fc + ^ gji lngji - In 6(7) 



lnB(e) + i^ln 



£*JW ■ 



(46) 



A7 



^ Pife lnpifc + g,-; In 



(36) 



ik 



These are a set of self-consistent equations which can be 
solved recursively to determine the probabilistic group 
assignments and the /i, a, n and k distributions. They 
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are the same in spirit as those for the EM algorithm [2j. 
Following 0, H3] we refer to them as variational Bayes 
(VB) algorithm. 

The main difference between the EM and VB algo- 
rithms is that in the former case we would take the aver- 
age of the log likelihood over the group assignments but 
not over the distributions of /i, a, n and n. By taking 
the average over fj, and a we obtain the additional 1 /a^i 
term within the parenthesis in equations ((37)) and (f38|) . 
According to (|40[) is equal to a plus the product of the 
average number of samples in sample group k Q^jPife) 
and the average number of variables in variable group I 
GZj Iji)- Therefore, the 1/au term penalizes assignments 
to small size groups. And it balances the contribution of 
(Xij — (fiki)) 2 , which drives the estimates towards a bet- 
ter fit and consequently groups of minimal size. 

A. VB implementation, real value data 

The actual implementation of the VB-1 algorithm in 
the context of real value data proceeds as follows. Set 
sufficiently large values for K and L, larger than our 
expectation for the actual values of K and L. In the 
following test examples we use K = L = 20. Set the pa- 
rameters a, p,ki, <7, 7^ and ei. We set a = = h = 10~ 6 , 
jlki — and cr = l. The choice of p,ki and a is practically 
irrelevant provided we have chosen a sufficiently small 
a. Set random initial conditions for pik and qji. Start- 
ing from these random initial conditions iterate equations 
(f3"T)) - (|46[) until the solution converges up to some prede- 
fined accuracy. We use relative error of F* smaller than 
10~ 6 . In practice, compute (flu), ^kl, c*, Ik, (hi7r fc ), e t , 
(hiK;), pik, qji and F* in that order. To explore different 
potential local minima use different initial conditions and 
select the solution with lowest F*. Since this algorithm 
penalizes groups with few members it turns out that, for 
sufficiently large K and L, some sample and condition 
groups result empty. If this is not the case K and/or L 
should be increased until at least one sample group and 
one variable group results empty. 

B. Test examples 




()i , 1 , 1 , 1 , 1 

u 1 2 3 4 

c 

FIG. 1: Clustering real value data: Mutual informa- 
tion I — I(p°,p*) between the original p° and estimated p* 
groups assignments, relative to its maximum value To when 
p* — p° . The original data was made of n = 100 samples 
divided in K groups and m = 100 conditions divided in L 
groups. The values of Xij were extracted from a normal dis- 
tribution with mean fi^i — k + I and variance a. The figure 
shows the mutual information between the original groups 
and the group assignment, estimated by the VB-1 algorithm, 
as a function of the variance a. The dashed-dotted, solid and 
dashed lines corresponds with the worst, average and best 
case on 100 test examples, respectively. In a) K — L = 2 and 
in b) K = L = 4. In both cases the mutual information is 
approximately equal to its maximum Iq for values of a less 
than one, the minimum difference between the original means 



To test the performance of the VB-1 algorithm, ([37]) - 
fl4"6]) . we consider test examples generated by the likeli- 
hood (f3"o) itself. Our aim is to test the variational result 
in the context of a relatively small number of samples 
and conditions. To quantify the goodness of the group 
assignment we consider the mutual information between 
the original p° (pf k = S 9i k) and estimated p* sample 
group assignments, 



Pkk' 



P?kPik 



(48) 



(49) 



I(p°,P*) 



where 



kk' 



Pkk' In 



Pkk' 
PkPl' 



(47) 



Pk 



n £ — ' 



P l k 



(50) 
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Note that I(p°,p*) takes its maximum value when p* 
it ' . denoted by I — I(p°,p°). Off course, the same 
could be done for the condition group assignments as 
well. 

In our test examples the original data was made of 
n = 100 samples divided in K groups and m = 100 con- 
ditions divided in L groups. The values of Xij were ex- 
tracted from a normal distribution with mean jiijy = k + l 
and variance a. We estimate the group assignment us- 
ing the VB-1 algorithm, sampling one initial condition. 
Figure [T] shows the mutual information between the orig- 
inal and estimated groups as a function of the variance 
(7. In a) K = L = 2 and in b) K = L = 4. In both 
cases the mutual information is approximately equal to 
its maximum Jo for values of a less than 1. Since 1 is 
the minimum difference between the original means hm , 
we conclude that the VB-1 algorithm performs well when 
there is a significant difference between the distributions 
associated with different groups. For larger values of a 
the VB-1 algorithm performance starts to decrease. This 
is not, however, a deficiency of the algorithm but an un- 
avoidable consequence of the mixing between the distri- 
butions coming from different groups. It is worth notic- 
ing that we obtain similar results for the case K = 4 and 
L = 1, indicating that the method works when there is no 
group structure on one side, in this case the conditions. 



VII. CASE STUDY: CLUSTERING DATA 
REPRESENTED BY HYPERGRAPHS AND 
BIPARTITE GRAPHS 

There are several datasets consisting of a certain num- 
ber of properties and the information of whether or not 
each sample exhibits each of the properties. For example, 
the dataset in Fig. [5] describes a population of three ani- 
mals characterized by two attributes, hair and legs. The 
attribute hair can take the value YES (has hair) or NO 
(does not have hair) while the attribute legs takes the val- 
ues 2 or 4 (at least within this dataset). The mathemat- 
ical treatment of this problem is significantly simplified 
if the variables are mapped onto Boolean variables. To 
each S states variable we associate S Boolean variables, 
each representing the occurrence or not of a specific let- 
ter of the alphabet. For example, the attribute hair is 
associated with hair- YES and hair-NO and the attribute 
legs with legs-2 and legs-4 (Fig. [2p). The outcome of 
this mapping is represented by the Boolean matrix a^, 
taking the value 1 if the answer to the Boolean variable 
j is YES on sample i and otherwise. 

Depending on our aim, the Boolean matrix can be rep- 
resented either by a hypergraph or a bipartite graph. 
When we aim to cluster the samples without attempting 
to cluster the Boolean variables, ajj is better interpreted 
as the adjacency matrix of a hypergraph. A hypergraph 
is an intuitive extension of the concept of graph to al- 
low for connections between more than two elements. In 
our case, the hypergraph vertices represent samples and 



a) 



'3 w 
3 £ 



b) 



1 deer 


YES 


2 gorilla 


YES 


3 duck 


NO 


(4) 


(2) 


©\ 






yv) 



hair: 



legs: 



hair-YES(l) 
hair-NO (2) 
legs-2 (3) 
legs-4 (4) 




FIG. 2: Hypergraph and bipartite graph data repre- 
sentations: a) An example of a problem with categorical 
data, b) Mapping of the categorical variables onto augmented 
Boolean variables, c) Hypergraph representation of the cate- 
gorical dataset in a), d) Bipartite graph representation of the 
categorical dataset in a), e) A graph example, f) Nearest- 
neighbor mapping of the graph in e) onto a hypergraph, where 
each hyper-edge represents a set of nearest neighbors of a ver- 
tex in the original graph, indicated by (1), (2), (3) and (4). g) 
Nearest-neighbor mapping of the graph in e) onto a bipartite 
graph. The original graph vertices are represented by 1, 2, 3 
and 4. The augmented bipartite graph vertices, representing 
nearest-neighbor sets, are represented by (1), (2), (3) and (4). 



hyper-edges, one associated which each Boolean variable, 
represent the set of all samples with the answer YES to 
the corresponding Boolean variable (Fig. On the 

other hand, when we aim to cluster both the samples 
and Boolean variables then a bipartite graph interpreta- 
tion is more appropriate, with one class of vertices for 
the samples and another one for the Boolean variables, 
and an edge connecting sample i and variable j when- 
ever ctij = 1 (HJi). The differences between these two 
approaches will become clear below. 



A. One side clustering: Statistical model on 
hypergraphs 

In this case the samples are assumed to be divided in 
groups while the hypergraph edges are modeled as inde- 
pendent. Here we follow the statistical model introduced 
in [13: 
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Data: Consider a hypergraph with a vertex set repre- 
senting n samples and m edges characterizing the rela- 
tionships among them. The hypergraph is specified by 
its adjacency matrix a, where — 1 if element i belongs 
to edge j and it is otherwise. 

Likelihood: The adjacency matrix elements are gen- 
erated by a binomial model with sample group and 
variable dependent probabilities 9 k j, k = 1,...,K and 
j = 1, . . . , m, resulting in 

P(a\g,6) = Y[02(l-9 gd ) 1 - a - , (51) 

ij 

Priors: As priors we use the renormalized invariant 
prior of the binomial model (Table [TV)) . Taking into ac- 
count that we have a binomial model for each pair of 
sample group and edge, we obtain 

P{6) = J] Beta(0 fcj ; a kj J kj ) (52) 

kj 

with &kj — * and (3 — * 0. 

Substitute the likelihood (JHU), the priors and ([52]) . 
and the MF variational function (|31 [) into ([3]), and inte- 
grating over <p (summing over gi and integrating over 9 k i 
and n k ) we obtain 



F < ( ^Pikdij + a kj - 1 (\n9 kj ) 

jk V i J 

- E (l> fc(1 ~ + ^ ~ 1 ) (ln(1 ~ 

jk \ i / 

+ J2 Vik Inpik + J d9R(9) In R{9) 

ik 

+ J dTTR(Tr) In R(tt) + const. (53) 

Minimizing ([53")) with respect to pu, R(9) and -R(7r) we 
obtain (VB-2) 



i?(vr) = D(tt; 7 ) , 7 fc = 7fc + ^ftt (58) 



F* = const. + y^Kfc \np ik 

ik 

- ^lnBK,,/3 fej )-lnB( 7 ) (59) 

These equations represent the VB algorithm for the sta- 
tistical model on hypergraphs. In this case we have not 
been able to disentangle the contributions weighting the 
fit to the data and the model bias, both being included 
in the averages (ln(9 k j)) and (ln(l — 9 k j)). 

B. VB algorithm implementation, statistical model 
on hypergraphs 

The implementation of the VB algorithm for the statis- 
tical model on hypergraphs proceeds as follows. Set suffi- 
ciently large values for K, larger than our expectation for 
the actual values of K. We use K = 20 in the following 
test examples. Set the parameters a k j, f3 k j and j k . We 
set the parameters akj = f3 k j = 7fe = 10 -6 . Set random 
initial conditions for pik . Starting from these initial con- 
ditions iterate equations (f54|) - (f59|) until the solution con- 
verges up to some predefined accuracy. We use relative 
error of F* smaller than 1CP 6 . In practice, compute a k j, 
Pkj, (In^), (ln(l - 9 kj )), 7fc, (In. 7^), p ik and F* in that 
order. To explore different potential local minima use 
different initial conditions and select the solution with 
lowest F* . Since this algorithm penalizes groups with 
few members it turns out that, for sufficiently large K, 
some sample and condition groups result empty. If this 
is not the case then increase K until at least one group is 
empty. A matlab code implementing this algorithm can 
be found at |http: / /www.sns.ias.edu^vazquez/hgc.htnu] 

1. Test example: zoo problem 



_ e ^)+T.^ii^AH^ii)(M-i--e ki ))\ Consider the animal population in Fig. Hi together 

Ptk J2 e < ln7r s >+5Dj[ a ij< lne sj>+( 1 - a i3)< ln ( 1 - 6 'sj)>] ^ ' with their attributes: habitat, nutrition behavior, etc. 

Figure [5Jd shows the mapping of this dataset onto a hy- 
pergraph. The hypergraph vertices represent animals 
R{9) = JjB(0fcj;a;y,/3iy) , (55) an d the edges represent the association between all an- 

k j imals with a given attribute: edge 1, all non-airborne 

animals; edge 2, all airborne animals, and so on. 

The animal population stratification was already ad- 
a kj = a kj + ^2p ik a i: j (56) dressed in [l7j], finding the solution in Fig. [3b. Although 

y the starting statistical model is the same, the solution in 

[l7| was found assuming fixed the number of groups and 
estimating the group assignment using the EM algorithm 
[3 k j = {3 k j y^p ik (l — djj) . (57) (essentially a maximum likelihood estimate). Then, in an 

' an attempt to focus in the solution with better consensus, 
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Group 1 antelope buffalo calf cavy deer elephanf fruitbat giraffe goat gorilla hamster hare oryx pony reindeer squirrel vampire vole wallaby 
Group 2aardvark bear boar cheetah leopard lion lynx mole mongoose opossum polecat puma pussycat raccoon wolf 
Group 3dolphin mink platypus porpoise seal sealion 

Group 4chicken crow dove duck flamingo gull hawk kiwi lark ostrich parakeet penguin pheasant rhea skimmer skua sparrow swan vulture wren 
Group 5bass carp catfish chub dogfish haddock herring pike piranha seahorse seasnake sole stingray tuna 
Group 6frog newt pitviper slowworm toad tortoise tuatara 
Group 7flea gnat honeybee housefly ladybird moth slug termite wasp worm 
Group 8clam crab crayfish lobster octopus scorpion seawasp starfish 

Group 1 aardvark antelope bear boar buffalo calf cheetah deer elephant giraffe goat leopard lion lynx mink mole mongoose opossum oryx polecat 

pony puma pussycat raccoon reindeer wolf 
Group 2 cavy fruitbat gorilla hamster hare squirrel vampire vole wallaby 
Group 3 dolphin porpoise seal sealion 

Group 4 chicken crow dove duck flamingo gull hawk kiwi lark ostrich parakeet penguin pheasant rhea skimmer skua sparrow swan vulture wren 
Group 5 bass carp catfish chub dogfish haddock herring pike piranha seahorse seasnake sole stingray tuna 
Group 6 frog newt pitviper slowworm toad tuatara 
Group 7 platypus tortoise 

Group 8 flea gnat honeybee housefly ladybird moth slug termite wasp worm 
Group 9 scorpion 

Group 10clam crab crayfish lobster octopus seawasp starfish 

Group 1 antelope buffalo calf deer elephant giraffe goat gorilla oryx pony reindeer wallaby 6 

Group 2aardvark bear boar cheetah leopard lion lynx mongoose polecat puma pussycat raccoon wolf 

Group 3cavy fruitbat hamster hare squirrel vampire vole 

Group 4mink mole opossum 

Group 5dolphin platypus porpoise seal sealion 

Group 6chicken crow dove duck flamingo gull hawk kiwi lark ostrich parakeet penguin pheasant rhea skimmer skua sparrow swan vulture wren 

Group 7bass carp catfish chub dogfish haddock herring pike piranha seahorse sole stingray tuna 

Group 8newt pitviper slowworm tuatara 

Group 9gnat honeybee housefly ladybird moth scorpion wasp 

Group 1 0flea slug termite tortoise worm 

Group 1 1 clam crab crayfish frog lobster octopus seasnake seawasp starfish toad 

Group 1backbone-NO venomous-YES legs-6 legs-8 legs-5 tail-NO domestic-YES f 

Group 2hair-YES eggs-NO milk-YES legs-4 

Group 3aquatic-YES breathes-NO fins-YES legs-NO 

Group 4feathers-YES airborne-YES toothed-NO legs-2 

Group 5catsize-YES 

Group 6predator-YES toothed-YES 

Group 7hair-NO eggs-YES milk-NO catsize-NO 

Group 8feathers-NO airborne-NO backbone-YES venomous-NO tail-YES domestic-NO 
Group 9aquatic-NO breathes-YES fins-NO 

FIG. 3: Stratification or an animal population: a) A list of animals is given together with certain attributes characterizing 
them. The complete dataset is available from [18||. Except for the attribute - legs - one and zero indicate possession or not, 
respectively, of the corresponding attribute. The problem consist on determining the optimal stratification of the animal 
population based on the provided attributes, b) Hypergraph representing the zoo data. Each line corresponds with an edge, 
whose elements are specified within the right column, c) Stratification as obtained in [TJ. d) Stratification by the VB-2 
algorithm, e) and f) Stratification of the animal population e) and Boolean variables f) by the VB-3 algorithm. 



solutions for different number of groups were obtained 
and the most representative solution was selected. 

Here we address the same problem using a Bayesian 
approach and the variational solution. We start from 



the same statistical model on hypergraphs but now ob- 
tain a solution using the VB-2 algorithm (|54|) -(f59 |l . sam- 
pling 10,000 initial conditions as in The solution 
found by the VB-2 algorithm (Fig. [SJi) is quite similar 
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to that previously found in 17] (Fig. [3k). The main 
differences are the splitting of the terrestrial mammals, 
the exclusion of the platypus and the tortoise from the 
amphibia-reptiles group and the scorpion from the terres- 
trial arthropods. More important, in both cases the main 
groups represent terrestrial mammals, aquatic mammals, 
birds, fishes, amphibia-reptiles, terrestrial arthropods 
and aquatic arthropods. The VB-2 (f54"f - (153p algorithm 
represents, however, a significant improvement over the 
approach followed in [l7| ■ It finds the consensus solution 
in one run, because it has built in the balance between 
better fitting and less bias. 



2. Test example: finding network modules 

The work by Newman and Leicht [19( provides a hint 
on how to apply the hypergraph clustering to the problem 
of finding modules or communities in a graph or network. 
A graph is made by a set of vertices and a set of edges, 
the latter being pairs of connected vertices. The idea of 
Leicht and Newman is a "guilty by association" principle: 
vertices between the same module of a graph will tend to 
have connections to the same other vertices. This prob- 
lem can be translated to a hypergraph problem, where 
the vertices are the graphs vertices, the hyper-edges are 
the set of nearest neighbors and the Boolean variables 
characterize whether or not a vertex belongs to the a set 
of nearest neighbors [13] (Fig. [2^ and f). More precisely, 
to each vertex we associate a hyper-edge, given by the 
set of its nearest neighbors. Therefore, there are m = n 
hyper-edges, one for every vertex in the original graph. 
The hypergraph adjacency matrix has the matrix element 
fly = 1 if vertex i belongs to hyper-edge j, i.e. if ver- 
tex i belongs to the nearest-neighbor set of vertex j, and 
dy = otherwise. If we label the nearest-neighbor sets 
with the same label as the vertices then the hypergraph 
adjacency matrix coincides with the adjacency matrix of 
the original graph. Thus, there is an exact mapping from 
the statistical model proposed by Newman and Leicht 
[l^ ] to the statistical model on hypergraphs. 

Having specified this mapping we use the VB-2 algo- 
rithm ([54]) - ([59]) , sampling one initial condition, to find 
the graph modules in the original graph. To illustrate its 
performance we consider as a case study a graph com- 
posed by two communities, with probabilities p\ and p2 
that two vertices within the same or different communi- 
ties are connected, respectively. As already anticipated 
by Newman and Leicht [l9j . the nearest-neighbor ap- 
proach can resolve both dense communities with lesser 
inter-community connections (pi 3> P2) and sparse com- 
munities with more inter-community connections (pi -C 
P2)- Figure 2] shows that the VB-2 algorithm performs 
quite well in those two regimes. 




0.2 ' 0.4 ' (16 ' 0.S 




FIG. 4: Finding graph modules, hypergraph model: 

Mutual information I — I(p°,p*) between the original p° 
and estimated p* groups assignments, relative to its maxi- 
mum value 7o when p* = p° . The original data was made of 
a graph with n = 100 vertices divided in K = 2 groups, with 
an intra- and inter-community connection probabilities p\ and 
pi, respectively. The figure shows the mutual information, be- 
tween the original groups and the group assignment estimated 
by the VB-2 algorithm (I54|) - (|59| ). as a function of the inter- 
community connectivity p2- The dashed-dotted, solid and 
dashed lines corresponds with the worst, average and best 
case on 100 test examples. In a) we deal with dense commu- 
nities (pi = 0.9) and the algorithm performs well (I/Io ~ 1) 
for small values of the inter-community connectivity proba- 
bility p2- In b) we deal with sparse communities (pi = 0.1) 
and the algorithm performs well for large values of the inter- 
community connectivity probability p2- 



C. Two sides clustering: statistical model on 
bipartite graphs 

We can face situations where there are groups of 
Boolean variables as well, requiring the clustering of both 
samples and Boolean variables. In this case the bipartite 
graph representation is more appropriate, with a class of 
vertices representing the samples and a class of vertices 
representing the Boolean variables. More precisely, 

Data: Consider a bipartite graph with two vertex sub- 
sets, representing n samples and m Boolean variables. 



12 



The graph is specified by its adjacency matrix a, where 
a.y = 1 when sample i is connected to Boolean variable j, 
i.e. if Boolean variable j is true for sample i, and = 
otherwise. 

Likelihood: The adjacency matrix elements are gener- 
ated by a binomial model with sample group and vari- 
able group dependent probabilities 9 k i, k = 1, . . . , K and 
I = 1, . . . , L, resulting in 



e <Kl>+£i fc Pifc[«ii (ln0 fc! )+(l-aij)<lii(l-0 fci ))] 



^ e («s)+2:,fcP l fc[a.j<l«9fe s > + (l-aij)<ln(l-''fc s ))] 



R{6)= l[B{9 k i;a k i,p k 



(65) 



P(a|5,c,( 



9i c j 



(60) 



Pikqji&ij 



(66) 



Priors: For P(0) we use the renormalized invariant 
prior of the binomial model. Taking into account that 
we have one binomial model per each pair of sample and 
variable group we obtain 



(67) 



Pfa) = IJB(0 W ; a W)/ 9«) 
kj 



(61) 



with otki — > and /3fe — > 0. 

The likelihood (f6"0"| is quite similar to (f5Tj) . the main 
difference being that now the statistical properties of the 
Boolean variables appear through their corresponding 
group assignments Cj. This increases the model com- 
plexity by considering a group structure for the Boolean 
variables and, at the same time, reduces the number of 9 
parameters. Furthermore, (|60|) contains (fSTjl as the par- 
ticular case where L — n and one group associated to 
each Boolean variable. 

Substituting the likelihood flgUJ), the priors flST]), ([23) 
and (|30|). and the MF variational function ([32]) in 
and integrating over (f> (summing over gi and Cj and in- 
tegrating over 9hi, ^k and m) we obtain 



^Pife lnp 4fc + ^ ^ In qji 



ik 



d9R(9) In R(9)+ J dirR^) lni?(?r) 
dKR(K) In R(k) + const. 



(62) 



Minimizing (|6"2"|) with respect to py, qjt, R(9), R(tt) and 
fl(/e) we obtain (VB-3) 



,<T*>+Eji?ii[o«(ii»e*,>+(i-oy)<in(i-e fc o>] 



Pa- 



<^ s )+E,-,9j![ai 3 <lne s i) + (l-a lj )(ln(l-e ai )>] 



(63) 



i?(vr) = D(tt; 7 ) , 7fc = 7fe + ^p ik (68) 



P(tt) =D(k;c) , e,=ei+5^5,-j (69) 



F* = const. + ^ pj fe lnp ife + ^ q^j In 

ife jl 

- ^lnB(a fci ,/3 fci )-lnB( 7 )-lnB(e) (70) 

kl 

Equations (|63|) - (f70|) represent the VB algorithm for the 
statistical model on bipartite graphs. They can be used 
to found modules or communities in graphs with a bipar- 
tite structure, including those representing samples and 
Boolean variables. 



D. VB algorithm implementation, statistical model 
on bipartite graphs 

The implementation of the VB-2 algorithm (I63|) - (T70|) 
for the statistical model on bipartite graphs proceeds as 
follows. Set sufficiently large values for K and L, larger 
than our expectation for the actual values of K and L. 
Set the parameters a k i, $kh Ik and ej. We set the pa- 
rameters a k i — (3m — 7 fc = li = 10~ 6 . Set random initial 
conditions for pi k and qji. Starting from these initial 
conditions iterate equations (|63|) - ([70|) until the solution 
converges up to some predefined accuracy. We use rela- 
tive error of F* smaller than 10~ 6 . In practice, compute 
a k j, Pkj, (ln6» fej ), (ln(l - 0jy)}, Ik, (hi7r fe ), ej, (Ihkj), 
Pik, Qji and F* in that order. To explore different po- 
tential local minima use different initial conditions and 
select the solution with lowest F*. Since this algorithm 
penalizes groups with few members it turns out that, for 
sufficiently large K and L some sample and/or variable 
groups result empty. If this is not the case, increase K 
and/or L until at least one sample group and one variable 
group results empty. 
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1. Test example: zoo problem 

Let us go back to the zoo problem (Fig. Now we 
represent this dataset by a bipartitite graph, with one 
class of vertices representing the animals and the other 
class the Boolean variables (e.g. Fig. [2k,, b and d) Us- 
ing the VB-3 algorithm ([63 ]) -([70 | . sampling 10,000 initial 
conditions as in [13], we perform a two-sides clustering 
of the bipartite graph obtaining the animal population 
stratification in Fig. [3^ and the Boolean variables strat- 
ification in Fig. [3f. The animal clusters are similar to 
those previously obtained using the statistical model on 
hypergraphs (Fig. and d). The main difference is 
the more refined subdivision of terrestrial mammals, now 
split in four groups (1, 2, 3 and 4). 

In addition to the animal population stratification the 
two-sides clustering provides association groups between 
the Boolean variables (Fig. |3J). These associations re- 
flect the fact that not all Boolean variables are indepen- 
dent, some of them are linked. For example, group 2 clus- 
ter four typical attributes of terrestrial mammals, they 
have hair, do not put eggs, milk and have four legs. In the 
same way, group 3 clusters attributes of fishes and group 
four of birds. Thus, in general, the bipartite graph model 
and the resulting two-sides clustering provides more in- 
formation than the hypergraph approach. 

2. Test example: finding network modules 

The bipartite graph model can be use to find network 
modules as well. In this case one class of vertices repre- 
sents the original graph vertices and the other represents 
sets of nearest neighbors (Fig. [2g). The two-sides clus- 
tering thus attempts to cluster both the original graph 
vertices and the sets of nearest neighbors. When the 
original graph is undirected the problem is symmetric 
(e.g. see Fig. [2V,). Indeed, if vertex i belongs to the 
nearest-neighbor set of vertex j then vertex j belongs to 
the nearest- neighbor set of vertex i. As a consequence 
the clustering on the original vertices side cannot be dif- 
ferentiated from the clustering of nearest-neighbor sets. 
Intuitively this means that when two vertices belong to 
the same graph module we can say that their nearest- 
neighbor sets belong to the same nearest-neighbor set 
group. 

Having specified this mapping we use the VB-3 algo- 
rithm (|6"3")) - ([T0|) , sampling one initial condition, to find 
the graph modules in the original graph. To illustrate 
its performance we consider once again a graph com- 
posed by two communities, with probabilities p\ and P2 
that two vertices within the same or different commu- 
nities are connected, respectively. Figure O shows that 
the VB-3 algorithm can resolve both dense communities 
with lesser inter-community connections (pi ~3> P2) and 
sparse communities with more inter-community connec- 
tions (pi <C P2)- 

The comparison of Fig. [5] and 0] indicates that the 
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FIG. 5: Finding graph modules, bipartite model: Mu- 
tual information / = I(p°,p*) between the original p° and 
estimated p* groups assignments, relative to its maximum 
value 7o when p* = p° . The original data was made of a 
graph with n = 100 vertices divided in K — 2 groups, with 
an intra and inter-community connection probabilities pi and 
P2, respectively. The figure shows the mutual information, 
between the original groups and the group assignment esti- 
mate by the VEM-3 algorithm (I63I) -(I70 | ), as a function of the 
inter-community connectivity p2. The dashed-dotted, solid 
and dashed lines corresponds with the worst, average and best 
case on 100 test examples. In a) we deal with dense commu- 
nities (pi = 0.9) and the algorithm performs well (I /Io ~ 1) 
for small values of the inter-community connectivity proba- 
bility p2- In b) we deal with sparse communities (pi = 0.1) 
and the algorithm performs well for large values of the inter- 
community connectivity probability p2. 



bipartite graph model performs slightly better than the 
hypergraph model. For example, focusing on the average 
performance, for pi = 0.9 the VB-3 algorithm performs 
almost perfectly till P2 — 0.6, while the VB-2 algorithm 
does till p-2 = 0.5. This could be, however, specific to the 
tested set of examples. Further research is required to 
determine which version performs better depending on 
the dataset under consideration. 
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VIII. DISCUSSION AND CONCLUSIONS 

The Bayesian approach allows for a systematic solution 
of data analysis problems. Its starting point is a statis- 
tical model of the data under consideration. From there, 
using Bayes rule, we can invert the statistical model to 
obtain the posterior distribution of the model parame- 
ters. The latter can be use, in principle, to calculate or 
compute averages or other magnitudes of interest. 

One of the main criticisms to the Bayesian approach 
is the apparent ambiguity in selecting the prior distribu- 
tions. Here we have worked further on Jaynes method 
[l5l |. claiming that the prior distributions are given by 
the most general distribution dictated by the symme- 
tries of the problem under consideration. One undesired 
consequence of this method is that when the symmetries 
are not sufficient constraints we obtain improper prior 
distributions. Yet, the use of improper priors can be 
avoided by working with renormalized distributions that 
are proper, and approach the improper prior in a certain 
limit. Using this approach we report here a correction to 
Jaynes prior for a likelihood with translation and scale 
invariance and a generalization of Jaynes prior for the 
binomial model to the multinomial model. 

Having resolve the issue about the prior distributions, 
we can proceed to the application of the Bayesian ap- 
proach to resolve a population structured. Taking in- 
spiration from mixture models [l[, in particular Dirich- 
let mixture models [l3[, we introduce general statisti- 
cal models with a built in population structure at the 
sample, and sample and variable, level. The model with 
a structure at the sample level aims one-side clustering 
problems, where the variables are assumed to be indepen- 
dent measurements. The model with a structure at both 
sample and variable level aims two-side clustering prob- 
lems, where there are classes of variables. These statis- 
tical models are then postulated as generative models of 
some dataset. Introducing a MF approximation as varia- 
tional function, wc then resolve the population structure 
by solving the inverse problem, i.e. determining the sam- 
ple and/or variable groups and model parameters from 
the data. 

To illustrate the applicability and systematicity of the 
variational method, here we study the problem of data 
clustering, in the context of real value and Boolean vari- 
ables. The outcome is a variational Bayes (VB) algo- 
rithm, a self-consistent set of equations to determine the 
group assignments and the model parameters. The VB 
algorithm is based on recursive equations similar to those 
for the EM algorithm, but with some intrinsic penaliza- 



tion for model bias. In the case of real value data, and 
under the assumption of normal distributions, the con- 
tributions favoring fitting and penalizing model bias are 
clearly disentangled. The fitting is quantified, as it is ex- 
pected for normally distributed variables, by the mean 
square deviation. The model bias is quantified by the 
inverse of the square root of the mean cluster sizes. The 
tendency to reduce the mean square deviation is thus 
balanced by a tendency to increase the cluster sizes. 

In the case of Boolean variables our analysis is based on 
a mapping into a hypergraph or bipartite graph. When 
we cluster the samples but not the Boolean variables the 
problem is mapped onto a statistical model on hyper- 
graphs [17j . On the other hand, when we perform a 
two-side clustering, clustering both the samples and the 
Boolean variables, the problem is mapped onto a statis- 
tical model on bipartite graphs. 

The VB algorithms associated with the statistical 
model on hypergraphs and bipartite graphs can be used 
to find modules on a graph. Starting on an idea by 
Newman and Leicht [19j |. we show that the problem 
of graph modules can be mapped onto the problem of 
finding hypergraph modules or bipartite graph modules, 
where the hypergraph edges and the augmented bipar- 
tite graph vertices represent nearest-neighbor sets in the 
original graph. The resulting VB algorithms represent 
a significant improvement over the maximum likelihood 
approaches followed in [l!| and [l7j . by including a self- 
consistent correction for model complexity and bias. 

It is worth mentioning that, depending on the starting 
statistical model, we could arrive to different versions of 
the VB algorithm. Indeed, for the finding graph modules 
problem we could use both the hypergraph and bipar- 
tite graph models. Furthermore, Hofman and Wiggins 
[Tij have obtained another version based on a statistical 
model with different intra and inter-community connec- 
tion probabilities. These approaches differ in the defini- 
tion of what constitutes a group, community or module. 
We use the definition by Newman and Leicht [l!| based 
on topological similarity, i.e. two vertixes are topolog- 
ically identical if they are connected to the same other 
vertices in the graph. Thus, we obtain group of vertices 
whose patterns of connectivity are similar. On the other 
hand, the definition used by Hofman and Wiggins 14] is 
based on the existence of two edge densities, character- 
izing the tendency of having an edge between intra- and 
inter-group pairs of vertices. Depending on the prob- 
lem and the question we are asking we may adopt one or 
the other definition, and use the corresponding clustering 
method. 
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